Predicting Mental Health Problems from Social Determinants and Caregiving Activities of Caregivers of Persons with Dementia: A Machine Learning Approach

BMIN503/EPID600 Final Project

Author

Hannah Cho


1 Overview

In the United States, as of 2021, approximately 34 million informal caregivers were providing unpaid support to persons living with dementia (PLWD). As dementia progresses to its end-stage or terminal phase, individuals with the condition often become fully dependent on caregivers for assistance with most daily activities, such as bathing, repositioning, and other essential personal care needs. Notably, 80% of PLWD receive care from informal caregivers—primarily spouses, adult children, and close friends—within community settings rather than institutionalized environments. The project addressed public health issues related to dementia caregiving, which directly affect caregivers’ emotional problems, such as depression and anxiety, by predicting key caregiving activities and sociodemographic features of those at high risk.

This project was conducted after consultations with Dr. George Demiris, a recognized expert in the field of dementia care and caregiving, and Dr. Huang, an experienced biostatistician. Their contributions provided critical insights into both the substantive and methodological aspects of the research, ensuring a rigorous and well-informed approach to addressing the challenges faced by caregivers of PLWD.

2 Introduction

In the United States, as of 2021, 34 million informal caregivers were providing unpaid support to persons living with dementia (PLWD; Alzheimer’s Association, 2023; McCabe et al., 2016; Reinhard et al., 2023). As PLWD progress to the end-stage or terminal stage of dementia, they often become dependent on assistance for most daily activities, including bathing and changing position. Of all PLWD, 80% receive care from individuals—often spouses, adult children, and close friends—in the community (Alzheimer’s Association, 2023).

Dementia caregivers face profoundly complex and multifaceted challenges that extend beyond the physical and emotional demands of caregiving. These challenges encompass physical tasks, such as providing continuous care, assisting with activities of daily living, and managing the multifarious health complications associated with dementia. However, the toll of caregiving is not solely physical; it also places a considerable emotional burden on caregivers. Many experience elevated levels of stress, anxiety, depression, and isolation, as the exhaustive nature of caregiving leaves little opportunity for self-care or social engagement.

Caregivers for persons living with dementia (PLWD) are particularly vulnerable to anxiety and depression due to the progressive and unpredictable trajectory of the disease. This trajectory often varies significantly based on individuals’ pre-existing conditions and comorbidities, adding to caregivers’ uncertainty and emotional strain.

Beyond these personal and emotional burdens, caregivers frequently encounter substantial social and systemic barriers that hinder their ability to access necessary support. These barriers include financial strain, a lack of accessible and affordable respite care, limited awareness of available resources, and cultural or social stigmas surrounding caregiving. Furthermore, caregivers often experience isolation due to diminished social engagement, compounding the psychological toll.

This study aims to explore how specific caregiving activities, individual characteristics, and sociodemographic factors predict the likelihood of anxiety and depression in dementia caregivers. Additionally, it investigates how effectively machine learning models can forecast these mental health outcomes, offering a novel approach to identifying caregivers at greatest risk and informing targeted interventions.

Problem Statement: Dementia caregiving is a demanding role, often accompanied by significant psychological challenges, including anxiety and depression. However, the specific caregiving activities, caregiver characteristics, and sociodemographic factors that most strongly predict these mental health outcomes remain insufficiently understood. Furthermore, while traditional statistical approaches have provided valuable insights, the potential of machine learning models to accurately forecast anxiety and depression in dementia caregivers remains underexplored. This research aims to identify the key predictors of these mental health outcomes and evaluate the predictive accuracy of machine learning algorithms in supporting early identification and targeted interventions for at-risk caregivers.

Research Question: This research aims to identify the key predictors of mental health outcomes and evaluate the predictive accuracy of machine learning algorithms in supporting the early identification of at-risk caregivers.

3 Methods

Dataset: This study used data from the National Health and Aging Trends Study (NHATS) Round 11 and the National Study of Caregiving (NSOC) Round 4, which include data collected in 2021. The NHATS is a publicly accessible dataset that includes a nationally representative sample of adults aged 65 years and older who are Medicare beneficiaries in the United States of America. The NHATS began in 2011 and included 8,245 participants who have been followed up annually since then; the study goal is to encourage research to maximize health and enhance the quality of life of older adults. The NSOC is conducted alongside the NHATS; participants in the NSOC are caregivers for older adults included in the NHATS. Both the NHATS and the NSOC were funded by the National Institute on Aging (R01AG062477; U01AG032947). When used together, the NHATS and NSOC provide valuable information on dyads of older adults receiving care and their family caregivers.

Samples: PLWD with dementia: Probable dementia was identified based on one of the following criteria: a self-reported diagnosis of dementia or Alzheimer’s disease by a physician, a score of 2 or higher on the AD8 screening instrument administered to proxy respondents, or a score that is 1.5 standard deviations below the mean on a range of cognitive tests.Caregivers: Caregivers are identified from the NSOC and NHATS data set. Since this project specifically aims to explore caregivers of persons with dementia in the community, the sample was further filtered through dementia classification (demclass) and residency (r11dresid).

Afer retriving NHATS Round 11 and NSOC ROUND 4, I specifically selected the sample (from NHATS R11- r11demclas). And then, I merged those necessary datasets.

  • As the purpose of this study was to examine family caregivers who provide care to persons living with dementia at home, we selected those who provided care to persons living with dementia and lived at home using the Dementia Classification with Programming Statements provided by the NHATS.

  • The final sample include 536 dyads.

    3.1 Loading Packages and Bringing Dataset

#Loading necessary packages first.
library(haven) #dta file
library(dplyr) #data cleaning

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2) #data visualization 
library(gtsummary) #summary statistics

#Bring datasets 
df1 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NHATS_Round_11_SP_File_V2.dta") # dementia classfication in this file
df2 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NSOC_r11.dta") #caregiver information 1
df3 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NSOC_cross.dta") #caregiver information 2
df4 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NHATS_Round_11_OP_File.dta") #older adults information 

3.2 Choosing Probable and Possible Dementia Participants

#need to clean df1 first in order to classify dementia classes  
#ENTER WHICH ROUND?
sp1 <- df1 |>
  mutate(rnd = 11) 

#3. EDIT ROUND NUMBER INSIDE THE QUOTES 
#(THIS REMOVES THE PREFIXES ON NEEDED VARIABLES ) 
sp1 <- sp1 |>
  rename_all(~stringr::str_replace(.,"^r11","")) |>
  rename_all(~stringr::str_replace(.,"^hc11","")) |>
  rename_all(~stringr::str_replace(.,"^is11","")) |>
  rename_all(~stringr::str_replace(.,"^cp11","")) |> 
  rename_all(~stringr::str_replace(.,"^cg11",""))

#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1
sp1 <- sp1 |>
  mutate(dad8dem = ifelse(rnd == 1, -1, dad8dem))


#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1
sp1 <- sp1 |> 
  mutate(dad8dem = ifelse(rnd == 1, -1, dad8dem))

#SUBSET NEEDED VARIABLES
df<-sp1 |> 
  dplyr::select(spid, rnd, dresid, resptype, disescn9, chgthink1, chgthink2, chgthink3, chgthink4, chgthink5, chgthink6, chgthink7, chgthink8, dad8dem,
                speaktosp, todaydat1, todaydat2, todaydat3, todaydat4, todaydat5, presidna1, presidna3, vpname1, vpname3, quesremem, dclkdraw, atdrwclck, 
                dwrdimmrc, dwrdlstnm, dwrddlyrc)

#FIX A ROUND 2 CODING ERROR#
df <- df |>
  mutate(dwrdimmrc = ifelse(dwrdimmrc==10 & dwrddlyrc==-3 & rnd==2, -3, dwrdimmrc))

#CREATE SELECTED ROUND DEMENTIA CLASSIFICATION VARIABLE 
df <- df |>
  mutate(demclas  =  ifelse(dresid==3 | dresid==5 | dresid==7, -9, #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)
                            ifelse((dresid==4 & rnd==1) | dresid==6 | dresid==8, -1,                #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)
                                   ifelse((disescn9==1 | disescn9==7) &           #CODE PROBABLE IF DEMENTIA DIAGNOSIS REPORTED BY SELF OR PROXY*
                                            (resptype==1 | resptype==2), 1, NA))))

#CODE AD8_SCORE*
#INITIALIZE COUNTS TO NOT APPLICABLE*
#ASSIGN VALUES TO AD8 ITEMS IF PROXY AND DEMENTIA CLASS NOT ALREADY ASSIGNED BY REPORTED DIAGNOSIS 
for(i in 1:8){
  df[[paste("ad8_", i, sep = "")]]  <- as.numeric(ifelse(df[[paste("chgthink", i, sep = "")]]==2 & df$resptype==2 & is.na(df$demclas), 0, #PROXY REPORTS NO CHANGE
                                                         ifelse((df[[paste("chgthink", i, sep = "")]]==1 | df[[paste("chgthink", i, sep = "")]] == 3) & df$resptype==2 & is.na(df$demclas), 1, #PROXY REPORTS A CHANGE OR ALZ/DEMENTIA*
                                                                ifelse(df$resptype==2 & is.na(df$demclas), NA, -1))))    #SET TO NA IF IN RES CARE AND demclass=., OTHERWISE AD8 ITEM IS SET TO NOT APPLICABLE                                                                                                                        
}

#INITIALIZE COUNTS TO NOT APPLICABLE*
for(i in 1:8){
  df[[paste("ad8miss_", i, sep = "")]]  <- as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep = "")]]), 1,
                                                             ifelse((df[[paste("ad8_", i, sep = "")]]==0 | df[[paste("ad8_", i, sep = "")]]==1) & df$resptype==2 & is.na(df$demclas), 0, -1)))
}

for(i in 1:8){
  df[[paste("ad8_", i, sep = "")]] <- as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep = "")]]) & is.na(df$demclas) & df$resptype==2, 0, df[[paste("ad8_", i, sep = "")]]))
}

#COUNT AD8 ITEMS
#ROUNDS 2+
df <- df |>
  mutate(ad8_score = ifelse(resptype==2 & is.na(demclas), (ad8_1 + ad8_2 + ad8_3 + ad8_4 + ad8_5 + ad8_6 + ad8_7 + ad8_8), -1)) %>% 
  #SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 
  mutate(ad8_score = ifelse(dad8dem==1 & resptype==2 & is.na(demclas), 8, ad8_score))  %>% 
  #SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 FOR ROUNDS 4-9
  mutate(ad8_score = ifelse(resptype==2 & dad8dem==-1 & chgthink1==-1 & (rnd>=4 & rnd<=9) & is.na(demclas) , 8, ad8_score)) 

#COUNT MISSING AD8 ITEMS
df <- df |> 
  mutate(ad8_miss = ifelse(resptype==2 & is.na(demclas),(ad8miss_1+ad8miss_2+ad8miss_3+ad8miss_4+ad8miss_5+ad8miss_6+ad8miss_7+ad8miss_8), -1))

#CODE AD8 DEMENTIA CLASS 
#IF SCORE>=2 THEN MEETS AD8 CRITERIA
#IF SCORE IS 0 OR 1 THEN DOES NOT MEET AD8 CRITERIA
df <- df |> 
  mutate(ad8_dem = ifelse(ad8_score>=2, 1,
                          ifelse(ad8_score==0 | ad8_score==1 | ad8_miss==8, 2, NA)))

#UPDATE DEMENTIA CLASSIFICATION VARIABLE WITH AD8 CLASS
df <- df |> 
  #PROBABLE DEMENTIA BASED ON AD8 SCORE  
  mutate(demclas = ifelse(ad8_dem==1 & is.na(demclas), 1, 
                          #NO DIAGNOSIS, DOES NOT MEET AD8 CRITERION, AND PROXY SAYS CANNOT ASK SP COGNITIVE ITEMS*
                          ifelse(ad8_dem==2 & speaktosp==2 & is.na(demclas), 3, demclas)))


####CODE DATE ITEMS AND COUNT 
#CODE ONLY YES/NO RESPONSES: MISSING/NA CODES -1, -9 LEFT MISSING*
#2: NO/DK OR -7: REFUSED RECODED TO : NO/DK/RF*
#****ADD NOTES HERE ABOUT WHAT IS HAPPENING IN ROUNDS 1-3, 5+ VS. ROUND 4 
#*
for(i in 1:5){
  df[[paste("date_item", i, sep = "")]]  <- as.numeric(ifelse(df[[paste("todaydat", i, sep = "")]]==1, 1,
                                                              ifelse(df[[paste("todaydat", i, sep = "")]]==2 | df[[paste("todaydat", i, sep = "")]]== -7, 0, NA)))
}

#COUNT CORRECT DATE ITEMS
df <- df |>
  mutate(date_item4 = ifelse(rnd==4, date_item5, date_item4)) %>% 
  mutate(date_sum = date_item1 + date_item2 + date_item3 + date_item4) %>% 
  
  #PROXY SAYS CAN'T SPEAK TO SP
  mutate(date_sum = ifelse(speaktosp==2 & is.na(date_sum),-2,  
                           #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER*
                           ifelse((is.na(date_item1) | is.na(date_item2) | is.na(date_item3) | is.na(date_item4)) & speaktosp==1,-3, date_sum))) %>% 
  
  #MISSING IF PROXY SAYS CAN'T SPEAK TO SP*  
  mutate(date_sumr = ifelse(date_sum == -2 , NA, 
                            #0 IF SP UNABLE TO ANSWER*
                            ifelse(date_sum == -3 , 0, date_sum)))


########PRESIDENT AND VICE PRESIDENT NAME ITEMS AND COUNT########## 
##CODE ONLY YES/NO RESPONSES: MISSING/N.A. CODES -1,-9 LEFT MISSING *
##2:NO/DK OR -7:REFUSED RECODED TO 0:NO/DK/RF*
df <- df |>
  mutate(preslast = ifelse(presidna1 == 1, 1,
                           ifelse(presidna1 == 2 | presidna1 == -7, 0, NA))) |> 
  mutate(presfirst = ifelse(presidna3 == 1, 1,
                            ifelse(presidna3 == 2 | presidna3 == -7, 0, NA))) |> 
  mutate(vplast = ifelse(vpname1 == 1, 1,
                         ifelse(vpname1 == 2 | vpname1 == -7, 0, NA))) |> 
  mutate(vpfirst = ifelse(vpname3 == 1, 1,
                          ifelse(vpname3 == 2 | vpname3 == -7, 0, NA))) |> 
  
  #COUNT CORRECT PRESIDENT/VP NAME ITEMS*
  mutate(presvp = preslast + presfirst + vplast + vpfirst) |> 
  #PROXY SAYS CAN'T SPEAK TO SP 
  mutate(presvp = ifelse(speaktosp == 2 & is.na(presvp), -2, 
                         #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER                           
                         ifelse((is.na(preslast) | is.na(presfirst) | is.na(vplast) | is.na(vpfirst)) & speaktosp==1 & is.na(presvp),-3, presvp))) |> 
  
  #MISSING IF PROXY SAYS CAN’T SPEAK TO SP*
  mutate(presvpr =  ifelse(presvp == -2 , NA, 
                           ifelse(presvp == -3 , 0, presvp))) |> 
  
  #ORIENTATION DOMAIN: SUM OF DATE RECALL AND PRESIDENT/VP NAMING* 
  mutate(date_prvp = date_sumr + presvpr)


#######EXECUTIVE FUNCTION DOMAIN: CLOCK DRAWING SCORE##########
#RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUND 10 ONLY)* 
df <- df |>
  mutate(dclkdraw = ifelse(speaktosp == 2 & dclkdraw == -9 & rnd==10, -2,
                           ifelse(speaktosp==1 & (quesremem==2 | quesremem==-7 | quesremem==-8) & dclkdraw==-9 & rnd==10, -3,
                                  ifelse(atdrwclck==2 & dclkdraw==-9 & rnd==10, -4,
                                         ifelse(atdrwclck==97 & dclkdraw==-9 & rnd==10, -7, dclkdraw)))))

#RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUNDS 11 AND FORWARD ONLY)* 
df<-df  |>
  mutate(dclkdraw = ifelse(speaktosp == 2 & dclkdraw == -9 & rnd>=11, -2, 
                           ifelse(speaktosp == 1 & (quesremem == 2 | quesremem == -7 | quesremem == -8) & dclkdraw == -9, -3 & rnd>=11, dclkdraw))) 
df<-df  |>
  mutate(clock_scorer = ifelse(dclkdraw == -3 | dclkdraw == -4 | dclkdraw == -7, 0,
                               #IMPUTE MEAN SCORE TO PERSONS MISSING A CLOCK*
                               #IF PROXY SAID CAN ASK SP*
                               ifelse(dclkdraw == -9 & speaktosp == 1, 2, 
                                      #IF SELF-RESPONDENT*       
                                      ifelse(dclkdraw == -9 & speaktosp == -1, 3, 
                                             ifelse(dclkdraw == -2 | dclkdraw == -9, NA, dclkdraw)))))


#MEMORY DOMAIN: IMMEDIATE AND DELAYED WORD RECALL 
df <- df |>
  mutate(irecall  =  ifelse(dwrdimmrc == -2 | dwrdimmrc == -1, NA,
                            ifelse(dwrdimmrc == -7 | dwrdimmrc == -3, 0, dwrdimmrc))) |> 
  mutate(irecall = ifelse(rnd==5 & dwrddlyrc==-9, NA, irecall)) |>  #round 5 only: set cases with missing word list and not previously assigned to missing
  
  mutate(drecall  =  ifelse(dwrddlyrc == -2 | dwrddlyrc == -1, NA,
                            ifelse(dwrddlyrc == -7 | dwrddlyrc == -3, 0, dwrddlyrc))) |> 
  mutate(drecall = ifelse(rnd==5 & dwrddlyrc==-9, NA, drecall)) |>  #round 5 only: set cases with missing word list and not previously assigned to missing
  
  mutate(wordrecall0_20 = irecall+drecall)


#CREATE COGNITIVE DOMAINS FOR ALL ELIGIBLE 

df<-df |> 
  mutate(clock65 = ifelse(clock_scorer == 0 | clock_scorer==1, 1, 
                          ifelse(clock_scorer > 1 & clock_scorer<6, 0, NA)))

df<-df |>  
  mutate(word65 = ifelse(wordrecall0_20 >= 0 & wordrecall0_20 <=3, 1, 
                         ifelse(wordrecall0_20 > 3 & wordrecall0_20 <=20, 0, NA)))

df<-df |>  
  mutate(datena65 = ifelse(date_prvp >= 0 & date_prvp <=3, 1, 
                           ifelse(date_prvp > 3 & date_prvp <= 8, 0, NA)))

#  *CREATE COGNITIVE DOMAIN SCORE*
df<-df |> 
  mutate(domain65 = clock65+word65+datena65)

#*SET CASES WITH MISSING WORD LIST AND NOT PREVIOUSLY ASSIGNED TO MISSING (ROUND 5 ONLY)
df<-df |>   
  mutate(demclas = ifelse(rnd==5 & dwrdlstnm==-9 & is.na(demclas), -9, demclas))

#UPDATE COGNITIVE CLASSIFICATION*
df<-df |> 
  #PROBABLE DEMENTIA
  mutate(demclas = ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & (domain65==2 | domain65==3), 1,
                          #POSSIBLE DEMENTIA
                          ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & domain65==1, 2,
                                 #NO DEMENITA                    
                                 ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & domain65==0, 3, demclas))))

#KEEP VARIABLES AND SAVE DATA
df<-df |> 
  dplyr::select(spid, rnd, demclas)

#CHANGE # AFTER "r" TO THE ROUND OF INTEREST
r11demclas <- df

#4. NAME AND SAVE DEMENTIA DATA FILE:
#CHANGE # AFTER "r" TO THE ROUND OF INTEREST
save(r11demclas, file = "~/R  HC/BMIN503_Final_Project/final final/NHATS_r11.dta") 

Once dementia class (demclas) is identified it is saved in the dataset ‘df1’.

3.3 Merging Datasets

#merged datasets (md). md1 <- left_join(df, df1, by = "spid") md2 <- left_join(md1, df3, by = "spid")
#merged datasets (md). 
md1 <- left_join(df, df1, by = "spid")
md2 <- left_join(md1, df3, by = "spid")

# choose probable dementia and dementia patients who live at home
dementia1 <- md2 |>
  filter(demclas %in% c("1", "2") & (r11dresid  %in% c("1")))

dementia2 <- md2 |>
  filter(demclas %in% c("1", "2") & (r11dresid  %in% c("1", "2")))

3.4 Preliminary Table Manipulation

Before creating the subset of the dataset for analysis, I will recode the variables. After reviewing the literature on dementia caregivers and social determinants of health, I selected the following variables.

Predictors: Caregiver level factors are identified as caregivers’ age, race, gender, self-reported income, and the highest education level. Also, these are recoded accordingly. The education level of the caregivers was categorized as “Less than high school (0)”, “High School (1)”, and “College or above (2).” For economic status, the caregivers’ reported income from the previous year was used. This study included both informal and formal support as part of the caregivers’ social determinants of health. Informal support included having friends or family (a) to talk to about important life matters, (b) to help with daily activities, such as running errands, and (c) to assist with care provision.10 Formal support included (a) participation in a support group for caregivers, (b) access to respite services that allowed the caregiver to take time off, and (c) involvement in a training program that assisted the caregiver in providing care for the care recipient.10 We used these individual items as support questions and each support question was answered by indicating whether or not they received support.

# Caregiver's Age #chd11dage

#Race
# Recode `race` to create a new binary variable
# 1 for "White, non-Hispanic" and 0 for "Non-White"
# Recode race to create a new variable 'race_recode'
dementia1 <- dementia1 |>
  mutate(
    race_recode = case_when(
      crl11dcgracehisp == 1 ~ 0,  # White, non-Hispanic
      crl11dcgracehisp == 2 ~ 1, #black, non-hispanic
      crl11dcgracehisp == 3 ~ 2,  # others
      crl11dcgracehisp == 4 ~ 3,
      chd11educ %in% c(5, 6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cases# hispanics
    ))  

table(dementia1$crl11dcgracehisp)

  1   2   3   4   5   6 
250 193  16  46   1  22 
table(dementia1$race_recode)

  0   1   2   3 
250 193  16  46 
# Gender: Male as reference (0), Female as 1
dementia1 <- dementia1 |>
  mutate(
    gender_recode = case_when(
      as.character(c11gender) == "1" ~ 0,  # Male
      as.character(c11gender) == "2" ~ 1,  # Female
      TRUE ~ NA_real_  # Handle any other unexpected cases
    )
  )

# Education: Recoding education levels into two categories
table(dementia1$chd11educ)

 -8  -7  -6   1   2   3   4   5   6   7   8   9 
 13   1   2   1   5  36 132  35 101  50  83  69 
dementia1 <- dementia1 |>
  mutate(
    edu_recode = case_when(
      chd11educ %in% c(1, 2, 3) ~ 1,      # Below and high school diploma
      chd11educ %in% c(4, 5) ~ 1,          # Some college
      chd11educ %in% c(6, 7, 8) ~ 2,       # College and beyond
      chd11educ == c(9) ~ 3,  
      chd11educ %in% c(-8, -7, -6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cases
    )
  )
table(dementia1$edu_recode)

  1   2   3 
209 234  69 
# Marital Status: Recoding marital status into binary (married vs. not married)
dementia1 <- dementia1 |>
  mutate(
    martstat_recode = case_when(
      chd11martstat == 1 ~ 0,     # Married
      chd11martstat %in% 2:6 ~ 1, # Not married (single, divorced, etc.)
      chd11martstat %in% c(-8, -6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cas
    )
  )

table(dementia1$martstat_recode)

  0   1 
204 232 
# Summary and Visualization of 'demclas'
table(dementia1$demclas) #1 probable dementia #2 dementia diagnosis

  1   2 
492 287 
# Adding labels for clarity
dementia1$demclas_label <- factor(
  dementia1$demclas, 
  levels = c(1, 2), 
  labels = c("Probable Dementia", "Possible Dementia")
)

# Plotting with the updated legend
ggplot(dementia1, aes(x = demclas_label)) + 
  geom_histogram(
    stat = "count", 
    color = "blue", 
    fill = "blue", 
    alpha = 0.7
  ) +
  labs(
    title = "Distribution of Dementia Classifications",
    x = "Dementia Class",
    y = "Count",
    fill = "Dementia Type"
  ) +
  theme_minimal()
Warning in geom_histogram(stat = "count", color = "blue", fill = "blue", :
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

library(dplyr)
library(tibble)
# preparing dataset 
final_dementia <- dementia1 |> 
  mutate(
    race_recode = case_when(
      crl11dcgracehisp == 1 ~ "White, non-Hispanic",
      crl11dcgracehisp == 2 ~ "Black, non-Hispanic",
      crl11dcgracehisp == 3 ~ "Other",
      crl11dcgracehisp == 4 ~ "Hispanic",
      TRUE ~ NA_character_
    ),
    gender_recode = case_when(
      as.character(c11gender) == "1" ~ "Male",
      as.character(c11gender) == "2" ~ "Female",
      TRUE ~ NA_character_
    ),
    edu_recode = case_when(
      chd11educ %in% c(1, 2, 3) ~ "Below high school",
      chd11educ %in% c(4, 5) ~ "Some college",
      chd11educ %in% c(6, 7, 8) ~ "College and beyond",
      chd11educ == 9 ~ "Unknown",
      TRUE ~ NA_character_
    ),
    martstat_recode = case_when(
      chd11martstat == 1 ~ "Married",
      chd11martstat %in% 2:6 ~ "Not married",
      TRUE ~ NA_character_
    )
  )

Caregivers’ caregiving activities

#recoding: 1(everyday), 2(most day), 3(someday)–1; 4(rarely),5(never)0 #C11 CA1 HOW OFT HELP WITH CHORES (cca11hwoftchs) #C11 CA2 HOW OFTEN SHOPPED FOR SP (cca11hwoftshp) #C11 CA6 HOW OFT HELP PERS CARE (cca11hwoftpc ) #C11 CA6B1 HELP CARE FOR TEETH (PREVIOUSLY CA11F) #C11 CA7 HOW OFT HLP GTNG ARD HOME (cca11hwofthom) #11 CA9 HOW OFTEN DROVE SP (cca11hwoftdrv) #C11 CA10 OFTN WENT ON OTH TRANSPR (cca11hwoftott )

library(dplyr)
dementia1 <- dementia1 |>
  mutate(across(
    c(cca11hwoftchs, cca11hwoftshp, cca11hwoftpc, 
      cca11hwofthom, cca11hwoftdrv, cca11hwoftott), 
  )) |>
  mutate(
    cca11hwoftchs_recoded = case_when(
      cca11hwoftchs %in% c(1, 2, 3) ~ 1,
      cca11hwoftchs %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftshp_recoded = case_when(
      cca11hwoftshp %in% c(1, 2, 3) ~ 1,
      cca11hwoftshp %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftpc_recoded = case_when(
      cca11hwoftpc %in% c(1, 2, 3) ~ 1,
      cca11hwoftpc %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwofthom_recoded = case_when(
      cca11hwofthom %in% c(1, 2, 3) ~ 1,
      cca11hwofthom %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftdrv_recoded = case_when(
      cca11hwoftdrv %in% c(1, 2, 3) ~ 1,
      cca11hwoftdrv %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftott_recoded = case_when(
      cca11hwoftott %in% c(1, 2, 3) ~ 1,
      cca11hwoftott %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    )
  )
  

#C11 CA3 EVER HELP ORDER MEDICINES (cca11hlpordmd) - yes 1 no 2-->0
#C11 CA5A GO ONLINE billing or banking (cca11hlpbnkng)
#C11 CA5A GO ONLINE SHOP FR GROC (cca11cmpgrcry)
#C11 CA5B GO ONLINE ORDER MEDS (cca11cmpordrx )
#C11 CA5C GO ONLINE BILLS BANKNG (cca11cmpbnkng )
#C11 CA6B1 HELP CARE FOR TEETH (PREVIOUSLY CA11F) (cca11hlpteeth) 
#cdc11hlpyrpls 

dementia1 <- dementia1 |>
  mutate(across(
    c(cca11hlpordmd, cca11hlpbnkng, cca11cmpgrcry, 
      cca11cmpordrx, cca11cmpbnkng, cca11hlpteeth,cdc11hlpyrpls),
    ~ case_when(
        . == 1 ~ 1,
        . == 2 ~ 0,
        . %in% c(-8, -6, -1) ~ NA_real_,
        TRUE ~ NA_real_
      ),
    .names = "{.col}_recoded"
  ))

#caregiver’s caregiving activities cca11hwoftchs #C11 CA1 HOW OFT HELP WITH CHORES cca11hwoftshp #C11 CA2 HOW OFTEN SHOPPED FOR SP cca11hwoftpc #C11 CA6 HOW OFT HELP PERS CARE cca11hwofthom #C11 CA7 HOW OFT HLP GTNG ARD HOME cca11hwoftdrv #11 CA9 HOW OFTEN DROVE SP cca11hwoftott #C11 CA10 OFTN WENT ON OTH TRANSPR

4 caregiver’s features

che11enrgylmt #Energy often limited cac11diffphy # Caregiver physical difficulty helping cac11exhaustd #Caregiver exhausted at night cac11toomuch #Care more than can handle cac11uroutchg #Care routine then changes cac11notime #No time for self (-8,-7,-6) cac11diffemlv #Caregiver emotional difficulty (-1, 1-5) cpp11hlpkptgo #Kept from going out (-6-1, 1,2) che11health #General health (-8, 1-5) che11sleepint #Interrupted sleep (-8,-7,-6, 1-5) op11numhrsday #Number of hours per day help (-7,-1, 1-6) op11numdaysmn #Number of days help per month (-7,-1 1-6)

#Sociodemographic features (persons living with dementia and caregiver) table(dementia1$cac11notime)

op11leveledu #Caregiver education (na -1, 1-5) cac11diffinc #Caregiver financial difficulties (-8,-7,-6, binary 1,2) ew11progneed1 #Persons living with dementia received food stamps (-8, -7 binary) ew11finhlpfam #Persons living with dementia financial help from family (-8, -7) binary mc11havregdoc #Persons living with dementia have a regular doctor (1,2binary) hc11hosptstay #Persons living with dementia hospital stay in last (1,2) 12-months hc11hosovrnht #Persons living with dementia number of hospital stays (-7, -1, 1-6 times)

Outcomes: Caregivers’ anxiety and depressive symptoms are measured by two questions each.First, anxiety was measured Generalized Anxiety Disorder-2 (GAD-2) Scale which consists of two questions. Since the NHATS provided GAD-2 data, this study utilized it to measure anxiety levels among care recipients. Each item on the scale is rated on a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting in a total score between 0 and 6. Higher scores correspond to greater anxiety, with a total GAD-2 score of 3 or more indicating anxiety.

The care recipients’ depression was evaluated using the Patient Health Questionnaire-2 (PHQ-2) Scale. Given that the NHATS included PHQ-2, this study utilized it to measure depression in care recipients. Each item on the scale was measured with a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting a total score between 0 and 6, with higher scores indicating more severe depression. A PHQ-2 score ranges from 0-6. The authors identified a score of 3 as the optimal cutpoint when using the PHQ-2 to screen for depression. If the score is 3 or greater, major depressive disorder is likely.

# Sum the two questions for GAD2
dementia1$total_gad2 <- dementia1$che11fltnervs + dementia1$che11fltworry
# Recode the combined variable using a cut-off of 3
dementia1$gad2_cg_cat <- ifelse(dementia1$total_gad2 < 3, 0, 1)
 table(dementia1$gad2_cg_cat)

  0   1 
279 249 
 summary(dementia1$gad2_cg_cat) #1 ~ anxiety
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.4716  1.0000  1.0000     251 
# Sum of the two questions for PHQ2 (che11fltltlin + che11fltdown) 
dementia1$total_phq2 <- dementia1$che11fltltlin+ dementia1$che11fltdown
#Recode the combined variable using a cut-off of 3
dementia1$phq2_cg_cat <- ifelse(dementia1$total_phq2 < 3, 0, 1)
 table(dementia1$phq2_cg_cat)

  0   1 
276 252 
 summary(dementia1$phq2_cg_cat) #1 ~ depression
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.4773  1.0000  1.0000     251 

Data analysis

For data analysis, we first conducted descriptive analyses, including means, standard deviations, ranges, and percentages, to summarize the dataset. To investigate how caregivers’ social strains and caregiver-level factors influence caregiver depression, we performed logistic regression analyses. Guided by the conceptual framework of this study, univariate logistic regression analyses were employed to identify caregivers’ social strains and caregiver-level factors significantly associated with caregiver anxiety and depression, controlling for care recipient-level factors. Variables with a p-value below 0.05 in the univariate analyses were included in the subsequent multivariate logistic regression model. The multivariate model was then constructed to determine which factors most strongly influenced caregiver anxiety and depression. All statistical analyses were conducted using R, with statistical significance set at a p-value of less than 0.05.

4.1

#choosing only one caregiver for each participant
final <- dementia1 |>
  group_by(spid) |>
  slice_head(n = 1) |>
  ungroup()

#total 563 caregivers 
#creating subset to work more effectively
library(dplyr)
selected_vars <- c(
  "spid","demclas", "cca11hwoftchs", "cca11hwoftshp", "cca11hwoftpc",
  "cca11hwofthom", "cca11hwoftdrv", "cca11hwoftott",
  "che11enrgylmt", "cac11diffphy", "cac11exhaustd",
  "cac11toomuch", "cac11uroutchg", "cac11notime",
  "cac11diffemlv", "cpp11hlpkptgo", "che11health", 
  "che11sleepint", 
   "ew11progneed1", "ew11finhlpfam", 
  "mc11havregdoc", "hc11hosptstay", "hc11hosovrnht", 
  "cac11diffinc", "pa11hlkepfvst", "pa11hlkpfrclb", "pa11hlkpgoenj", "pa11hlkpfrwrk", "pa11hlkpfrvol", "pa11prcranoth", "race_recode", "gender_recode", "edu_recode", "chd11dage", "martstat_recode", "phq2_cg_cat", "gad2_cg_cat"
)

dementia_subset <- dementia1 |> select(all_of(selected_vars))
head(dementia_subset)
# A tibble: 6 × 37
      spid demclas cca11hwoftchs    cca11hwoftshp    cca11hwoftpc  cca11hwofthom
     <dbl>   <dbl> <dbl+lbl>        <dbl+lbl>        <dbl+lbl>     <dbl+lbl>    
1 10000036       1 NA               NA               NA            NA           
2 10000041       2  1 [1 EVERY DAY]  1 [1 EVERY DAY]  1 [1 EVERY …  1 [1 EVERY …
3 10000041       2  5 [5 NEVER]      4 [4 RARELY]     5 [5 NEVER]   3 [3 SOME D…
4 10000051       1  1 [1 EVERY DAY]  3 [3 SOME DAYS]  2 [2 MOST D…  3 [3 SOME D…
5 10000064       1  2 [2 MOST DAYS]  5 [5 NEVER]      3 [3 SOME D…  3 [3 SOME D…
6 10000064       1  1 [1 EVERY DAY]  1 [1 EVERY DAY]  5 [5 NEVER]   3 [3 SOME D…
# ℹ 31 more variables: cca11hwoftdrv <dbl+lbl>, cca11hwoftott <dbl+lbl>,
#   che11enrgylmt <dbl+lbl>, cac11diffphy <dbl+lbl>, cac11exhaustd <dbl+lbl>,
#   cac11toomuch <dbl+lbl>, cac11uroutchg <dbl+lbl>, cac11notime <dbl+lbl>,
#   cac11diffemlv <dbl+lbl>, cpp11hlpkptgo <dbl+lbl>, che11health <dbl+lbl>,
#   che11sleepint <dbl+lbl>, ew11progneed1 <dbl+lbl>, ew11finhlpfam <dbl+lbl>,
#   mc11havregdoc <dbl+lbl>, hc11hosptstay <dbl+lbl>, hc11hosovrnht <dbl+lbl>,
#   cac11diffinc <dbl+lbl>, pa11hlkepfvst <dbl+lbl>, pa11hlkpfrclb <dbl+lbl>, …
# Select only numeric columns
numeric_dementia1 <- dementia_subset |> select(where(is.numeric))

# Compute the correlation matrix
cor_matrix <- cor(numeric_dementia1, method = "kendall", use = "pairwise.complete.obs")
# Filter correlations above a threshold
high_corr <- which(abs(cor_matrix) > 0.8 & abs(cor_matrix) < 1, arr.ind = TRUE)
high_corr_pairs <- data.frame(
  Var1 = rownames(cor_matrix)[high_corr[, 1]],
  Var2 = colnames(cor_matrix)[high_corr[, 2]],
  Correlation = cor_matrix[high_corr]
)
print(high_corr_pairs)
           Var1          Var2 Correlation
1 cca11hwoftdrv cca11hwoftdrv   1.0000000
2 cca11hwoftott cca11hwoftott   1.0000000
3 hc11hosptstay hc11hosptstay   1.0000000
4 hc11hosovrnht hc11hosptstay  -0.9391286
5 hc11hosptstay hc11hosovrnht  -0.9391286
6 pa11prcranoth pa11prcranoth   1.0000000
7 gender_recode gender_recode   1.0000000
# Install and load pheatmap if necessary
library(pheatmap)

# Create the heatmap
pheatmap(cor_matrix, color = colorRampPalette(c("blue", "white", "red"))(50))

Since my outcomes are GAD-2 and PHQ-2, I am most focused on those variables surrounding the square: che11energylmt, che11health, cac11diffemlv, hc11hosovrnht, marstat_recode, race_recode, mc11havregdoc, gender_recode, cca11hwoftott, ia11totinc, ew11finhlpfam, and ew11progneed.

Pearson’s correlation matrices were presented as heatmaps for both rounds (5 and 7) to visually assess the data and evaluate the independence of variables (Supplementary Figure 2a and b). The Pearson’s correlation coefficient quantifies the linear relationship between two continuous variables, with values ranging from −1 to +1. A coefficient of 0 indicates no linear correlation, while negative and positive values represent negative and positive correlations, respectively. A p-value of < .05 was used to define statistical significance. Then I selected the most representative variable to reduce redundancy in concepts among highly correlated items. The negative values in the dataset were not addressed in this step.

4.1.1 Feature selection

The original dataset includes numerous negative and N/A values, and the sample size is small, necessitating preprocessing before feature selection. To handle the small sample size, negative values were transformed into N/A values. These N/A values were then imputed based on the type of feature. For continuous variables, N/A values were replaced with the median of the respective column, while the most frequent category level was used to impute N/A values for categorical features.

che11energylmt, che11health, cac11diffemlv, hc11hosovrnht, marstat_recode, race_recode, mc11havregdoc, gender_recode, cca11hwoftott, ia11totinc, ew11finhlpfam, ew11progneed.

#coverting negative value to NA
dementia_subset[dementia_subset < 0] <- NA

#imputation
continuous_vars <- sapply(dementia_subset, is.numeric)  # categorical
dementia_subset[continuous_vars] <- lapply(dementia_subset[continuous_vars], function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x))

# categorical: change NA to max
categorical_vars <- sapply(dementia_subset, is.factor)  # find
dementia_subset[categorical_vars] <- lapply(dementia_subset[categorical_vars], function(x) {
  levels(x) <- append(levels(x), names(sort(table(x), decreasing = TRUE))[1])  
  ifelse(is.na(x), names(sort(table(x), decreasing = TRUE))[1], x)  
})
#final cleaning for dataset
final_dementia <- dementia_subset |>
  group_by(spid) |>
  slice_head(n = 1) |>
  ungroup()

#creating new dataset for each outcome
anxiety <- subset(
  final_dementia,
  select = c(gad2_cg_cat, cca11hwoftchs, cca11hwoftshp, cca11hwoftott, che11enrgylmt, 
             cac11diffphy, cac11exhaustd, cac11diffemlv, che11health, che11sleepint, ew11progneed1, race_recode, gender_recode, edu_recode, 
             chd11dage, martstat_recode)
)

depression <- subset(
  final_dementia,
  select = c(phq2_cg_cat, cca11hwoftchs, cca11hwoftshp, cca11hwoftott, che11enrgylmt, 
             cac11diffphy, cac11exhaustd, cac11diffemlv, che11health, che11sleepint, ew11progneed1, race_recode, gender_recode, edu_recode, 
             chd11dage, martstat_recode)
)

4.2 Results

4.2.1 Characteristics of Participants

library(table1)

Attaching package: 'table1'
The following objects are masked from 'package:base':

    units, units<-
#labeling

# Create the table with the correct handling of categorical and continuous variables
table1(~ race_recode + gender_recode + edu_recode + martstat_recode + chd11dage | demclas, 
       data = final_dementia,
       render.categorical = function(x, name, ...) {
         # Count the categories and calculate percentage
         count <- table(x)
         sprintf("%d (%0.1f%%)", count, count / sum(count) * 100)
       },
       render.continuous = function(x, name, ...) {
         # Calculate mean and standard deviation for continuous variable
         sprintf("%.1f (± %.1f)", mean(x, na.rm = TRUE), sd(x, na.rm = TRUE))
       })
Warning in table1.formula(~race_recode + gender_recode + edu_recode +
martstat_recode + : Terms to the right of '|' in formula 'x' define table
columns and are expected to be factors with meaningful labels.
1
(N=331)
2
(N=232)
Overall
(N=563)
race_recode 0.9 (± 0.8) 0.8 (± 0.5) 0.8 (± 0.7)
gender_recode 0.8 (± 0.4) 0.9 (± 0.3) 0.8 (± 0.4)
edu_recode 1.8 (± 0.6) 1.8 (± 0.5) 1.8 (± 0.5)
martstat_recode 0.8 (± 0.4) 0.9 (± 0.3) 0.8 (± 0.4)
chd11dage 65.2 (± 9.2) 64.6 (± 7.3) 65.0 (± 8.5)

4.2.1.1 Graph

# Dementia class analyses

#labeling variables.
final1 <- final_dementia |> 
  mutate(
    race_recode = factor(race_recode, levels = c("0", "1", "2", "3"),
                         labels = c("White, non-Hispanic", "Black, non-Hispanic", "Other", "Hispanic")),
    gender_recode = factor(gender_recode, levels = c("0", "1"),
                           labels = c("Male", "Female")),
    edu_recode = factor(edu_recode, levels = c("0", "1", "2" ),
                        labels = c("Below high school", "Some college", "College and beyond")),
    martstat_recode = factor(martstat_recode, levels = c("0", "1"),
                             labels = c("Not Married", " Married"))
  )

# 1. Bar Plot for Categorical Variables (e.g., Race, Gender, Education, Marital Status)
# Race
ggplot(final1, aes(x = race_recode, fill = demclas)) + 
  geom_bar() +  # Default bar plot based on counts
  labs(title = "Race Distribution by Demographic Class",
       x = "Race",
       y = "Count") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

# Gender - Count plot
ggplot(final1, aes(x = gender_recode, fill = demclas)) + 
  geom_bar() +  # Default bar plot based on counts
  labs(title = "Gender Distribution by Demographic Class",
       x = "Gender",
       y = "Count") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

# Education - Count plot
ggplot(final1, aes(x = edu_recode, fill = demclas)) + 
  geom_bar() +  # Default bar plot based on counts
  labs(title = "Education Distribution by Demographic Class",
       x = "Education",
       y = "Count") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

# Marital Status - Count plot
ggplot(final1, aes(x = martstat_recode, fill = demclas)) + 
  geom_bar() +  # Default bar plot based on counts
  labs(title = "Marital Status Distribution by Demographic Class",
       x = "Marital Status",
       y = "Count") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

# 2. Boxplot for Continuous Variable (Age)
ggplot(final1, aes(x = demclas, y = chd11dage, fill = demclas)) +
  geom_boxplot() +
  labs(title = "Age Distribution by Demographic Class",
       x = "Demographic Class",
       y = "Age") +
  scale_fill_brewer(palette = "Set3") + 
  theme_minimal()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

This exploratory study employs multiple machine learning techniques—including correlation matrix analysis, glm, and random forest (RF)—to identify key predictors of caregiver depression and anxiety. This multipronged approach is essential given the diverse types of data in this study. Machine learning methods, being inductive, support hypothesis generation and allow for systematic feature reduction by excluding variables deemed unimportant across multiple methods. This approach refines the feature set, enhancing the interpretability and predictive accuracy of the models.

4.2.2 Logistic regression

# Logistic regression for outcome gad2_cg_cat
model_gad2 <- glm(
  gad2_cg_cat ~ cca11hwoftchs + cca11hwoftshp + cca11hwoftott + che11enrgylmt + 
    cac11diffphy + cac11exhaustd + cac11diffemlv + che11health + che11sleepint + 
    ew11progneed1 + race_recode + gender_recode + edu_recode + chd11dage + 
    martstat_recode, 
  data = anxiety, 
  family = binomial(link = "logit")
)
summary(model_gad2)

Call:
glm(formula = gad2_cg_cat ~ cca11hwoftchs + cca11hwoftshp + cca11hwoftott + 
    che11enrgylmt + cac11diffphy + cac11exhaustd + cac11diffemlv + 
    che11health + che11sleepint + ew11progneed1 + race_recode + 
    gender_recode + edu_recode + chd11dage + martstat_recode, 
    family = binomial(link = "logit"), data = anxiety)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      7.055252   2.225233   3.171  0.00152 ** 
cca11hwoftchs    0.198116   0.133486   1.484  0.13776    
cca11hwoftshp   -0.096310   0.145559  -0.662  0.50819    
cca11hwoftott    0.203160   0.220421   0.922  0.35669    
che11enrgylmt   -0.082552   0.254866  -0.324  0.74601    
cac11diffphy    -0.978454   0.358695  -2.728  0.00638 ** 
cac11exhaustd   -1.419467   0.215747  -6.579 4.73e-11 ***
cac11diffemlv   -0.015984   0.182646  -0.088  0.93026    
che11health      0.331630   0.161528   2.053  0.04006 *  
che11sleepint   -0.521428   0.173869  -2.999  0.00271 ** 
ew11progneed1    0.460358   0.413274   1.114  0.26531    
race_recode     -0.564930   0.181982  -3.104  0.00191 ** 
gender_recode   -0.771277   0.300930  -2.563  0.01038 *  
edu_recode      -0.630911   0.233963  -2.697  0.00700 ** 
chd11dage       -0.002079   0.013950  -0.149  0.88151    
martstat_recode -0.869216   0.339016  -2.564  0.01035 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 660.67  on 562  degrees of freedom
Residual deviance: 411.90  on 547  degrees of freedom
AIC: 443.9

Number of Fisher Scoring iterations: 5
# Logistic regression for outcome phq2_cg_cat
model_phq2 <- glm(
  phq2_cg_cat ~ cca11hwoftchs + cca11hwoftshp + cca11hwoftott + che11enrgylmt + 
    cac11diffphy + cac11exhaustd + cac11diffemlv + che11health + che11sleepint + 
    ew11progneed1 + race_recode + gender_recode + edu_recode + chd11dage + 
    martstat_recode, 
  data = depression, 
  family = binomial(link = "logit")
)
summary(model_phq2)

Call:
glm(formula = phq2_cg_cat ~ cca11hwoftchs + cca11hwoftshp + cca11hwoftott + 
    che11enrgylmt + cac11diffphy + cac11exhaustd + cac11diffemlv + 
    che11health + che11sleepint + ew11progneed1 + race_recode + 
    gender_recode + edu_recode + chd11dage + martstat_recode, 
    family = binomial(link = "logit"), data = depression)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      6.010701   2.121163   2.834 0.004602 ** 
cca11hwoftchs   -0.026695   0.131499  -0.203 0.839129    
cca11hwoftshp   -0.001903   0.143780  -0.013 0.989440    
cca11hwoftott    0.092992   0.205062   0.453 0.650202    
che11enrgylmt   -0.238207   0.242829  -0.981 0.326608    
cac11diffphy    -0.755139   0.347061  -2.176 0.029569 *  
cac11exhaustd   -1.244365   0.204329  -6.090 1.13e-09 ***
cac11diffemlv    0.138485   0.174980   0.791 0.428691    
che11health      0.338219   0.156592   2.160 0.030782 *  
che11sleepint   -0.294320   0.161730  -1.820 0.068787 .  
ew11progneed1    0.271095   0.380739   0.712 0.476452    
race_recode     -0.640039   0.176302  -3.630 0.000283 ***
gender_recode   -0.801365   0.289794  -2.765 0.005687 ** 
edu_recode      -0.725094   0.228080  -3.179 0.001477 ** 
chd11dage        0.002991   0.013403   0.223 0.823433    
martstat_recode -0.680916   0.330734  -2.059 0.039514 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 652.71  on 562  degrees of freedom
Residual deviance: 445.84  on 547  degrees of freedom
AIC: 477.84

Number of Fisher Scoring iterations: 5
# Calculate odds ratios and confidence intervals for gad2_cg_cat model
odds_ratios_gad2 <- exp(coef(model_gad2))
conf_intervals_gad2 <- exp(confint(model_gad2))
Waiting for profiling to be done...
# Combine into a tidy table for gad2_cg_cat
results_gad2 <- data.frame(
  Variable = names(odds_ratios_gad2),
  Odds_Ratio = odds_ratios_gad2,
  CI_Lower = conf_intervals_gad2[, 1],
  CI_Upper = conf_intervals_gad2[, 2]
)
print("Odds Ratios for gad2_cg_cat Model:")
[1] "Odds Ratios for gad2_cg_cat Model:"
print(results_gad2)
                       Variable   Odds_Ratio   CI_Lower     CI_Upper
(Intercept)         (Intercept) 1158.9291601 15.5110504 9.780599e+04
cca11hwoftchs     cca11hwoftchs    1.2191040  0.9381851 1.585724e+00
cca11hwoftshp     cca11hwoftshp    0.9081825  0.6829866 1.210961e+00
cca11hwoftott     cca11hwoftott    1.2252679  0.8096916 1.926016e+00
che11enrgylmt     che11enrgylmt    0.9207632  0.5554478 1.514254e+00
cac11diffphy       cac11diffphy    0.3758918  0.1852315 7.595119e-01
cac11exhaustd     cac11exhaustd    0.2418430  0.1564642 3.654413e-01
cac11diffemlv     cac11diffemlv    0.9841432  0.6884434 1.412463e+00
che11health         che11health    1.3932377  1.0204185 1.925540e+00
che11sleepint     che11sleepint    0.5936722  0.4193536 8.313417e-01
ew11progneed1     ew11progneed1    1.5846416  0.7227122 3.671437e+00
race_recode         race_recode    0.5684000  0.3960653 8.100796e-01
gender_recode     gender_recode    0.4624220  0.2563228 8.360574e-01
edu_recode           edu_recode    0.5321066  0.3356476 8.416502e-01
chd11dage             chd11dage    0.9979228  0.9707682 1.025558e+00
martstat_recode martstat_recode    0.4192802  0.2150644 8.152782e-01
# Calculate odds ratios and confidence intervals for phq2_cg_cat model
odds_ratios_phq2 <- exp(coef(model_phq2))
conf_intervals_phq2 <- exp(confint(model_phq2))
Waiting for profiling to be done...
# Combine into a tidy table for phq2_cg_cat
results_phq2 <- data.frame(
  Variable = names(odds_ratios_phq2),
  Odds_Ratio = odds_ratios_phq2,
  CI_Lower = conf_intervals_phq2[, 1],
  CI_Upper = conf_intervals_phq2[, 2]
)
print("Odds Ratios for phq2_cg_cat Model:")
[1] "Odds Ratios for phq2_cg_cat Model:"
print(results_phq2)
                       Variable  Odds_Ratio  CI_Lower     CI_Upper
(Intercept)         (Intercept) 407.7690755 6.5860233 2.755568e+04
cca11hwoftchs     cca11hwoftchs   0.9736579 0.7502830 1.257981e+00
cca11hwoftshp     cca11hwoftshp   0.9980988 0.7542048 1.327802e+00
cca11hwoftott     cca11hwoftott   1.0974528 0.7423853 1.666580e+00
che11enrgylmt     che11enrgylmt   0.7880393 0.4846498 1.261420e+00
cac11diffphy       cac11diffphy   0.4699451 0.2378984 9.313639e-01
cac11exhaustd     cac11exhaustd   0.2881238 0.1910987 4.267062e-01
cac11diffemlv     cac11diffemlv   1.1485325 0.8172844 1.627034e+00
che11health         che11health   1.4024482 1.0377242 1.920274e+00
che11sleepint     che11sleepint   0.7450383 0.5409786 1.021993e+00
ew11progneed1     ew11progneed1   1.3113991 0.6348716 2.838612e+00
race_recode         race_recode   0.5272717 0.3705401 7.411819e-01
gender_recode     gender_recode   0.4487161 0.2541691 7.934679e-01
edu_recode           edu_recode   0.4842792 0.3086426 7.561883e-01
chd11dage             chd11dage   1.0029951 0.9768238 1.029731e+00
martstat_recode martstat_recode   0.5061534 0.2643094 9.698317e-01
library(broom)
tidy_gad2 <- tidy(model_gad2, exponentiate = TRUE, conf.int = TRUE)
tidy_phq2 <- tidy(model_phq2, exponentiate = TRUE, conf.int = TRUE)

print("Tidy Results for gad2_cg_cat Model:")
[1] "Tidy Results for gad2_cg_cat Model:"
print(tidy_gad2)
# A tibble: 16 × 7
   term            estimate std.error statistic  p.value conf.low conf.high
   <chr>              <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
 1 (Intercept)     1159.       2.23      3.17   1.52e- 3   15.5   97806.   
 2 cca11hwoftchs      1.22     0.133     1.48   1.38e- 1    0.938     1.59 
 3 cca11hwoftshp      0.908    0.146    -0.662  5.08e- 1    0.683     1.21 
 4 cca11hwoftott      1.23     0.220     0.922  3.57e- 1    0.810     1.93 
 5 che11enrgylmt      0.921    0.255    -0.324  7.46e- 1    0.555     1.51 
 6 cac11diffphy       0.376    0.359    -2.73   6.38e- 3    0.185     0.760
 7 cac11exhaustd      0.242    0.216    -6.58   4.73e-11    0.156     0.365
 8 cac11diffemlv      0.984    0.183    -0.0875 9.30e- 1    0.688     1.41 
 9 che11health        1.39     0.162     2.05   4.01e- 2    1.02      1.93 
10 che11sleepint      0.594    0.174    -3.00   2.71e- 3    0.419     0.831
11 ew11progneed1      1.58     0.413     1.11   2.65e- 1    0.723     3.67 
12 race_recode        0.568    0.182    -3.10   1.91e- 3    0.396     0.810
13 gender_recode      0.462    0.301    -2.56   1.04e- 2    0.256     0.836
14 edu_recode         0.532    0.234    -2.70   7.00e- 3    0.336     0.842
15 chd11dage          0.998    0.0140   -0.149  8.82e- 1    0.971     1.03 
16 martstat_recode    0.419    0.339    -2.56   1.03e- 2    0.215     0.815
print("Tidy Results for phq2_cg_cat Model:")
[1] "Tidy Results for phq2_cg_cat Model:"
print(tidy_phq2)
# A tibble: 16 × 7
   term            estimate std.error statistic       p.value conf.low conf.high
   <chr>              <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
 1 (Intercept)      408.       2.12      2.83   0.00460          6.59  27556.   
 2 cca11hwoftchs      0.974    0.131    -0.203  0.839            0.750     1.26 
 3 cca11hwoftshp      0.998    0.144    -0.0132 0.989            0.754     1.33 
 4 cca11hwoftott      1.10     0.205     0.453  0.650            0.742     1.67 
 5 che11enrgylmt      0.788    0.243    -0.981  0.327            0.485     1.26 
 6 cac11diffphy       0.470    0.347    -2.18   0.0296           0.238     0.931
 7 cac11exhaustd      0.288    0.204    -6.09   0.00000000113    0.191     0.427
 8 cac11diffemlv      1.15     0.175     0.791  0.429            0.817     1.63 
 9 che11health        1.40     0.157     2.16   0.0308           1.04      1.92 
10 che11sleepint      0.745    0.162    -1.82   0.0688           0.541     1.02 
11 ew11progneed1      1.31     0.381     0.712  0.476            0.635     2.84 
12 race_recode        0.527    0.176    -3.63   0.000283         0.371     0.741
13 gender_recode      0.449    0.290    -2.77   0.00569          0.254     0.793
14 edu_recode         0.484    0.228    -3.18   0.00148          0.309     0.756
15 chd11dage          1.00     0.0134    0.223  0.823            0.977     1.03 
16 martstat_recode    0.506    0.331    -2.06   0.0395           0.264     0.970

4.2.3 GLM

4.2.3.1 Anxiety

#creating training/ testing data
anxiety_split <- initial_split(anxiety, 
                            prop = 0.80)
anxiety_split #<450/113/563>
<Training/Testing/Total>
<450/113/563>
anxiety_train <- training(anxiety_split)
anxiety_test <- testing(anxiety_split)

#glm
set.seed(123)
lr_class_spec<-logistic_reg()|>
  set_engine("glm")

anxiety_train$gad2_cg_cat <- as.factor(anxiety_train$gad2_cg_cat)
anxiety_test$gad2_cg_cat <- as.factor(anxiety_test$gad2_cg_cat)

lr_class_fit <- lr_class_spec |>
  fit(gad2_cg_cat ~ ., data = anxiety_train)

## top variables
significant_predictors <- tidy(lr_class_fit$fit) |>
  filter(p.value < 0.05)
print(significant_predictors)
# A tibble: 8 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      6.19      2.53       2.44 1.45e- 2
2 cca11hwoftshp   -0.362     0.175     -2.06 3.93e- 2
3 cca11hwoftott    0.579     0.269      2.15 3.15e- 2
4 cac11exhaustd   -1.72      0.255     -6.73 1.72e-11
5 che11health      0.530     0.193      2.75 6.04e- 3
6 race_recode     -0.526     0.204     -2.58 9.84e- 3
7 gender_recode   -1.01      0.345     -2.94 3.33e- 3
8 edu_recode      -0.600     0.268     -2.24 2.51e- 2
### 
anxiety_glm_wf <- workflow() |>
  add_model(lr_class_spec) |>
  add_formula(gad2_cg_cat ~ .)


  # Fit the workflow to the test data
anxiety_glm_fit <- anxiety_glm_wf |> 
  fit(data = anxiety_test)

# Generate predictions with probabilities
anxiety_glm_predicted <- predict(anxiety_glm_fit, new_data = anxiety_test, type = "prob")

anxiety_glm_predicted 
# A tibble: 113 × 2
   .pred_0   .pred_1
     <dbl>     <dbl>
 1   1.00  0.000477 
 2   0.998 0.00223  
 3   0.998 0.00223  
 4   0.135 0.865    
 5   1.00  0.0000518
 6   0.998 0.00223  
 7   0.998 0.00223  
 8   0.998 0.00223  
 9   0.670 0.330    
10   0.158 0.842    
# ℹ 103 more rows
# Combine into a single data frame
anxiety_glm_pred_values <- bind_cols(
  truth = anxiety_test$gad2_cg_cat,  # Actual values of the outcome variable
  predict(anxiety_glm_fit, new_data = anxiety_test),  # Predicted class labels
  predict(anxiety_glm_fit, new_data = anxiety_test, type = "prob")  # Predicted probabilities
)

print(anxiety_glm_pred_values)
# A tibble: 113 × 4
   truth .pred_class .pred_0   .pred_1
   <fct> <fct>         <dbl>     <dbl>
 1 0     0             1.00  0.000477 
 2 0     0             0.998 0.00223  
 3 0     0             0.998 0.00223  
 4 1     1             0.135 0.865    
 5 0     0             1.00  0.0000518
 6 0     0             0.998 0.00223  
 7 0     0             0.998 0.00223  
 8 0     0             0.998 0.00223  
 9 1     0             0.670 0.330    
10 1     1             0.158 0.842    
# ℹ 103 more rows
#cross validation  20
anxiety_folds<-vfold_cv(anxiety_train, v=20)

glm_workflow<-workflow()|>
  add_model(lr_class_spec)|>
  add_formula(gad2_cg_cat ~ .)

glm_fit_cv<-glm_workflow|>
  fit_resamples(anxiety_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(glm_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.818    20  0.0199 Preprocessor1_Model1
2 brier_class binary     0.130    20  0.0134 Preprocessor1_Model1
3 roc_auc     binary     0.876    20  0.0225 Preprocessor1_Model1
#AUC 0.8808

anxiety_glm_cv_preds<-collect_predictions(glm_fit_cv)
anxiety_glm_cv_preds |>
  group_by(id) |>
  roc_curve(truth = gad2_cg_cat, .pred_0) |>
  autoplot()

#Prediction on the test data
anxiety.lr.pred.values.test <-  bind_cols(
  truth = anxiety_test$gad2_cg_cat,
  predict(lr_class_fit, anxiety_test),
  predict(lr_class_fit, anxiety_test, type = "prob")
)
anxiety.lr.pred.values.test
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     1             0.350 0.650  
 2 0     0             0.931 0.0690 
 3 0     0             0.931 0.0690 
 4 1     0             0.995 0.00531
 5 0     0             0.940 0.0604 
 6 0     0             0.931 0.0690 
 7 0     0             0.931 0.0690 
 8 0     0             0.931 0.0690 
 9 1     1             0.404 0.596  
10 1     1             0.195 0.805  
# ℹ 103 more rows
#Plot of ROC curve of prediction on test results
autoplot(roc_curve(anxiety.lr.pred.values.test, 
                   truth, 
                   .pred_0))

#Metrics of prediction on test data
metrics(anxiety.lr.pred.values.test, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.796
2 kap         binary         0.447
3 mn_log_loss binary         0.442
4 roc_auc     binary         0.866
#roc_auc 0.7938

Depression

#creating training/ testing data
depression_split <- initial_split(depression, 
                            prop = 0.80)
depression_split #<450/113/563>
<Training/Testing/Total>
<450/113/563>
depression_train <- training(depression_split)
depression_test <- testing(depression_split)

#glm
set.seed(123)
lr_class_spec<-logistic_reg()|>
  set_engine("glm")

depression_train$phq2_cg_cat <- as.factor(depression_train$phq2_cg_cat)
depression_test$phq2_cg_cat <- as.factor(depression_test$phq2_cg_cat)

lr_class_fit <- lr_class_spec |>
  fit(phq2_cg_cat ~ ., data = depression_train)

## top variables
significant_predictors_depression <- tidy(lr_class_fit$fit) |>
  filter(p.value < 0.05)
print(significant_predictors_depression)
# A tibble: 4 × 5
  term          estimate std.error statistic     p.value
  <chr>            <dbl>     <dbl>     <dbl>       <dbl>
1 cac11exhaustd   -1.22      0.237     -5.16 0.000000252
2 race_recode     -0.601     0.196     -3.07 0.00213    
3 gender_recode   -0.658     0.335     -1.97 0.0494     
4 edu_recode      -0.632     0.246     -2.57 0.0101     
### 
depression_glm_wf <- workflow() |>
  add_model(lr_class_spec) |>
  add_formula(phq2_cg_cat ~ .)

# Fit the workflow to the test data
depression_glm_fit <- depression_glm_wf |> 
  fit(data = depression_test)

# Generate predictions with probabilities
depression_glm_predicted <- predict(depression_glm_fit, new_data = depression_test, type = "prob")

depression_glm_predicted 
# A tibble: 113 × 2
   .pred_0 .pred_1
     <dbl>   <dbl>
 1  0.0864  0.914 
 2  0.0417  0.958 
 3  0.958   0.0417
 4  0.182   0.818 
 5  0.947   0.0526
 6  0.947   0.0526
 7  0.947   0.0526
 8  0.0141  0.986 
 9  0.947   0.0526
10  0.947   0.0526
# ℹ 103 more rows
# Combine into a single data frame
depression_glm_pred_values <- bind_cols(
  truth = depression_test$phq2_cg_cat,  # Actual values of the outcome variable
  predict(depression_glm_fit, new_data = depression_test),  # Predicted class labels
  predict(depression_glm_fit, new_data = depression_test, type = "prob")  # Predicted probabilities
)

print(depression_glm_pred_values)
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 1     1            0.0864  0.914 
 2 1     1            0.0417  0.958 
 3 0     0            0.958   0.0417
 4 1     1            0.182   0.818 
 5 0     0            0.947   0.0526
 6 0     0            0.947   0.0526
 7 0     0            0.947   0.0526
 8 1     1            0.0141  0.986 
 9 0     0            0.947   0.0526
10 0     0            0.947   0.0526
# ℹ 103 more rows
#cross validation  20
depression_folds<-vfold_cv(depression_train, v=20)

depression_glm_wf <- workflow ()|>
  add_model(lr_class_spec)|>
  add_formula(phq2_cg_cat ~ .)

depression_glm_fit_cv<-depression_glm_wf |>
  fit_resamples(depression_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(depression_glm_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.789    20  0.0225 Preprocessor1_Model1
2 brier_class binary     0.144    20  0.0109 Preprocessor1_Model1
3 roc_auc     binary     0.829    20  0.0279 Preprocessor1_Model1
#AUC 0.8219

depression_glm_cv_preds<-collect_predictions(depression_glm_fit_cv)
depression_glm_cv_preds |>
  group_by(id) |>
  roc_curve(truth = phq2_cg_cat, .pred_0) |>
  autoplot()

#Prediction on the test data
depression.lr.pred.values.test <-  bind_cols(
  truth = depression_test$phq2_cg_cat,
  predict(lr_class_fit, depression_test),
  predict(lr_class_fit, depression_test, type = "prob")
)
depression.lr.pred.values.test
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 1     0             0.805  0.195 
 2 1     1             0.280  0.720 
 3 0     0             0.841  0.159 
 4 1     0             0.846  0.154 
 5 0     0             0.918  0.0817
 6 0     0             0.918  0.0817
 7 0     0             0.918  0.0817
 8 1     1             0.234  0.766 
 9 0     0             0.918  0.0817
10 0     0             0.918  0.0817
# ℹ 103 more rows
#Plot of ROC curve of prediction on test results
autoplot(roc_curve(depression.lr.pred.values.test, 
                   truth, 
                   .pred_0))

#Metrics of prediction on test data
metrics(depression.lr.pred.values.test, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.796
2 kap         binary         0.485
3 mn_log_loss binary         0.407
4 roc_auc     binary         0.884
#roc_auc 0.8837

4.2.4 Random Forest (RF)

I specified a random forest model with 1,000 trees and a minimum node size of 5, using the randomForest engine for classification and enabling variable importance. I trained the model on the anxiety_train dataset and visualized the top predictors based on Mean Decrease Gini.

For model validation, I performed 20-fold cross-validation, integrating the random forest model into a workflow. The model achieved a strong ROC-AUC score of 0.91, demonstrating excellent classification performance.

4.2.4.1 Anxiety

library(tune)
library(parsnip)
library(recipes)
library(rsample)
library(workflows)

rf_spec<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

rf_fit<-rf_spec|>
  fit(gad2_cg_cat ~ ., data=anxiety_train)

## top variables
rf_fit|>
  extract_fit_engine()|>
  vip()

rf_fit|>
  extract_fit_engine()|>
  importance()|>
  as.data.frame()|>
  arrange(desc(MeanDecreaseGini))
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
cac11exhaustd   23.171355 33.1263850            32.248333        28.361197
chd11dage       19.340410 12.1197165            22.120615        20.340709
che11sleepint   17.422506  9.5534887            19.049225        14.820783
che11health     20.936824  4.1216617            20.820806        12.342446
race_recode      9.539030 13.6110142            13.485208        10.535703
che11enrgylmt   21.745856  7.7147062            21.696816         8.833165
cac11diffphy    18.428041 11.2226957            20.002653         8.764675
cac11diffemlv   13.795726 10.5474742            17.193358         8.658396
cca11hwoftshp   13.200983  1.3003715            12.319740         7.588515
cca11hwoftchs    8.367029 -0.5633838             7.530039         7.395683
edu_recode       8.970853 -0.7405831             7.983682         5.537575
gender_recode    7.614697  1.7871179             7.512533         3.357692
martstat_recode  6.308413  2.5878496             7.133171         3.149396
cca11hwoftott   11.229287 -4.9477327             4.992850         3.026739
ew11progneed1    1.740454  0.8349129             1.808328         2.031407
#20fold CV- rf
anxiety_folds<-vfold_cv(anxiety_train, v=20)
rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(gad2_cg_cat ~ .)

rf_fit_cv <- rf_workflow |>
  fit_resamples(anxiety_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(rf_fit_cv) 
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.831    20 0.0187  Preprocessor1_Model1
2 brier_class binary     0.109    20 0.00839 Preprocessor1_Model1
3 roc_auc     binary     0.909    20 0.0143  Preprocessor1_Model1
#roc_auc = 0.9100
anxiety_rf_cv_preds<-collect_predictions(rf_fit_cv)
anxiety_rf_cv_preds|>
  group_by(id)|>
  roc_auc(truth = gad2_cg_cat, .pred_0)
# A tibble: 20 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.858
 2 Fold02 roc_auc binary         0.974
 3 Fold03 roc_auc binary         0.914
 4 Fold04 roc_auc binary         0.938
 5 Fold05 roc_auc binary         0.974
 6 Fold06 roc_auc binary         0.9  
 7 Fold07 roc_auc binary         0.967
 8 Fold08 roc_auc binary         0.969
 9 Fold09 roc_auc binary         0.875
10 Fold10 roc_auc binary         0.920
11 Fold11 roc_auc binary         0.941
12 Fold12 roc_auc binary         0.861
13 Fold13 roc_auc binary         0.859
14 Fold14 roc_auc binary         0.893
15 Fold15 roc_auc binary         0.964
16 Fold16 roc_auc binary         0.824
17 Fold17 roc_auc binary         0.988
18 Fold18 roc_auc binary         0.906
19 Fold19 roc_auc binary         0.723
20 Fold20 roc_auc binary         0.938
# Plot ROC curve
anxiety_rf_cv_preds |>
  group_by(id)|>
  roc_curve(truth = gad2_cg_cat, .pred_0) |>
  autoplot()

# Fit the random forest model on the full training data
anxiety_rf_fit <- rf_spec |>
  fit(gad2_cg_cat ~ ., data = anxiety_train)

anxiety_rf_fit
parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 18.44%
Confusion matrix:
    0  1 class.error
0 276 48   0.1481481
1  35 91   0.2777778
#testing
anxiety_rf_pred_values <- bind_cols(
  truth = anxiety_test$gad2_cg_cat,  # Actual values of the outcome variable
  predict(anxiety_rf_fit, new_data = anxiety_test),  # Predicted class labels
  predict(anxiety_rf_fit, new_data = anxiety_test, type = "prob")  # Predicted probabilities
)
roc_auc(anxiety_rf_pred_values,
        truth, 
        .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary           0.9
#roc_auc = 0.8577
autoplot(roc_curve(anxiety_rf_pred_values, 
                   truth, 
                   .pred_0))

I did cross-validation predictions and calculated the ROC-AUC for each fold. Then, I plotted the ROC curve for the cross-validation results. Afterward, I fitted the random forest model to the full training data (anxiety_train) and predicted class labels and probabilities on the test data, achieving a ROC-AUC of 0.8577. Finally, you plotted the ROC curve for the test set predictions.

anxiety_rf_fit |>
  extract_fit_engine() |>
  importance()
                        0            1 MeanDecreaseAccuracy MeanDecreaseGini
cca11hwoftchs    7.396187  0.209105248             6.788656         7.606053
cca11hwoftshp   11.365262  1.036279295            10.811638         7.470802
cca11hwoftott   10.560545 -3.818954121             5.744920         2.947853
che11enrgylmt   22.237587  9.517067992            22.266853         8.746744
cac11diffphy    17.240319  9.881505012            18.912879         8.087237
cac11exhaustd   24.221128 33.437068729            33.067182        29.504680
cac11diffemlv   14.548896 11.054291348            17.567550         8.515510
che11health     21.906843  6.392345010            21.823328        12.355135
che11sleepint   17.089350  9.038386034            18.679314        14.501086
ew11progneed1    3.624847 -0.008495376             2.812053         2.069359
race_recode      8.554264 13.283747145            12.377256        10.561756
gender_recode    4.666134  0.609293550             4.473157         3.219151
edu_recode       8.794639  0.570691952             8.160257         5.389098
chd11dage       18.343714 10.974421772            21.020283        20.870940
martstat_recode  5.387864  3.179791268             6.181515         3.157529
anxiety_rf_fit |>
  extract_fit_engine() |>
  vip()

#20-fold cross validation on rf
anxiety_folds<-vfold_cv(anxiety_train, v=20)

rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(gad2_cg_cat ~ .)

rf_fit_cv_anxiety <- rf_workflow |>
  fit_resamples(resamples = anxiety_folds)

collect_metrics(rf_fit_cv_anxiety)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.833    20 0.0162  Preprocessor1_Model1
2 brier_class binary     0.108    20 0.00702 Preprocessor1_Model1
3 roc_auc     binary     0.916    20 0.0116  Preprocessor1_Model1
#roc_auc = 0.917

rf_wf_fit_cv_anxiety <- rf_workflow |>
  fit_resamples(
    resamples = anxiety_folds,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(rf_wf_fit_cv_anxiety) #roc_auc = 0.9154106  
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.833    20 0.0133  Preprocessor1_Model1
2 brier_class binary     0.108    20 0.00712 Preprocessor1_Model1
3 roc_auc     binary     0.916    20 0.0121  Preprocessor1_Model1
rf_wf_fit_cv_anxiety |>
  collect_predictions() |>
  roc_curve(gad2_cg_cat, .pred_0) |>
  autoplot()

# Collect metrics from the resampling results
rf_wf_fit_cv_anxiety_metrics <- collect_metrics(rf_wf_fit_cv_anxiety)
#roc_auc = 0.9213547    

# If you need predictions for further analysis
rf_wf_fit_cv_anxiety_preds <- collect_predictions(rf_wf_fit_cv_anxiety)

The random forest model demonstrated strong performance with consistent ROC AUC values across different evaluations: 0.917, 0.9154, and 0.9213. These results indicate the model’s ability to effectively distinguish between the classes. The ROC curve further confirmed this, showing the model’s good classification ability.

4.2.4.2 Depression

rf_fit |>
  extract_fit_engine() |>
  vip()

rf_spec<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

depression_rf_fit<-rf_spec|>
  fit(phq2_cg_cat ~ ., data=depression_train)

## top variables
depression_rf_fit |>
  extract_fit_engine() |>
  importance() |>
  as.data.frame() |>
  arrange(desc(MeanDecreaseGini))
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
chd11dage       12.404728 11.6399525           15.8382957        19.115136
cac11exhaustd   10.455619 24.1763284           20.1142534        18.813052
che11health     21.518333  8.5107774           22.0389435        13.480689
race_recode      9.337045 20.7765546           16.4775664        13.219742
che11sleepint   11.643240  8.8756520           14.3873522        11.857723
che11enrgylmt   18.200996 13.9651369           21.2909751         9.052639
cca11hwoftchs    9.229348 -1.4023941            7.8691197         8.254814
cac11diffemlv   13.910341  5.3995193           14.1620228         8.011183
cca11hwoftshp   11.860510  1.2594479           11.4930569         7.358412
cac11diffphy    14.205512  3.6435162           14.3581231         6.940735
edu_recode      10.407832  1.8302965           10.5024355         6.096042
cca11hwoftott   14.497526  3.7979911           14.5014590         3.912281
gender_recode   10.001609 -0.9933611            8.9000685         3.186318
martstat_recode  8.846703  1.2260871            8.6452001         3.154750
ew11progneed1    1.389528 -2.1601955           -0.3144413         1.848866
depression_rf_fit|>
  extract_fit_engine()|>
  importance()|>
  as.data.frame()|>
  arrange(desc(MeanDecreaseGini))
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
chd11dage       12.404728 11.6399525           15.8382957        19.115136
cac11exhaustd   10.455619 24.1763284           20.1142534        18.813052
che11health     21.518333  8.5107774           22.0389435        13.480689
race_recode      9.337045 20.7765546           16.4775664        13.219742
che11sleepint   11.643240  8.8756520           14.3873522        11.857723
che11enrgylmt   18.200996 13.9651369           21.2909751         9.052639
cca11hwoftchs    9.229348 -1.4023941            7.8691197         8.254814
cac11diffemlv   13.910341  5.3995193           14.1620228         8.011183
cca11hwoftshp   11.860510  1.2594479           11.4930569         7.358412
cac11diffphy    14.205512  3.6435162           14.3581231         6.940735
edu_recode      10.407832  1.8302965           10.5024355         6.096042
cca11hwoftott   14.497526  3.7979911           14.5014590         3.912281
gender_recode   10.001609 -0.9933611            8.9000685         3.186318
martstat_recode  8.846703  1.2260871            8.6452001         3.154750
ew11progneed1    1.389528 -2.1601955           -0.3144413         1.848866
#20fold CV- rf
depression_folds<-vfold_cv(depression_train, v=20)
depression_rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(phq2_cg_cat ~ .)

depression_rf_fit_cv <- depression_rf_workflow |>
  fit_resamples(depression_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(depression_rf_fit_cv) 
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.807    20  0.0221 Preprocessor1_Model1
2 brier_class binary     0.122    20  0.0101 Preprocessor1_Model1
3 roc_auc     binary     0.881    20  0.0217 Preprocessor1_Model1
#roc_auc = 0.8975

I created a random forest model with 1000 trees and a minimum node size of 5 to classify depression using the phq2_cg_cat variable. After fitting the model to the training data, I extracted the importance of the top variables, using the MeanDecreaseGini measure to identify the most influential features.

For evaluation, I performed 20-fold cross-validation on the training data, resulting in a ROC AUC of 0.8975, indicating good performance in classifying depression.

depression_rf_cv_preds<-collect_predictions(depression_rf_fit_cv)

depression_rf_cv_preds|>
  group_by(id)|>
  roc_auc(truth = phq2_cg_cat, .pred_0)
# A tibble: 20 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.564
 2 Fold02 roc_auc binary         0.789
 3 Fold03 roc_auc binary         0.844
 4 Fold04 roc_auc binary         0.839
 5 Fold05 roc_auc binary         1    
 6 Fold06 roc_auc binary         0.892
 7 Fold07 roc_auc binary         0.902
 8 Fold08 roc_auc binary         0.892
 9 Fold09 roc_auc binary         0.843
10 Fold10 roc_auc binary         0.889
11 Fold11 roc_auc binary         0.95 
12 Fold12 roc_auc binary         0.786
13 Fold13 roc_auc binary         0.877
14 Fold14 roc_auc binary         0.914
15 Fold15 roc_auc binary         1    
16 Fold16 roc_auc binary         0.861
17 Fold17 roc_auc binary         0.886
18 Fold18 roc_auc binary         0.967
19 Fold19 roc_auc binary         0.965
20 Fold20 roc_auc binary         0.958
# Plot ROC curve
depression_rf_cv_preds |>
  group_by(id)|>
  roc_curve(truth = phq2_cg_cat, .pred_0) |>
  autoplot()

# Fit the random forest model on the full training data
depression_rf_fit <- rf_spec |>
  fit(phq2_cg_cat ~ ., data = depression_train)

depression_rf_fit
parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 18.22%
Confusion matrix:
    0  1 class.error
0 302 31  0.09309309
1  51 66  0.43589744
#testing
depression_rf_pred_values <- bind_cols(
  truth = depression_test$phq2_cg_cat,  # Actual values of the outcome variable
  predict(depression_rf_fit, new_data = depression_test),  # Predicted class labels
  predict(depression_rf_fit, new_data = depression_test, type = "prob")  # Predicted probabilities
)
roc_auc(depression_rf_pred_values,
        truth, 
        .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.882
#roc_auc = 0.9109

autoplot(roc_curve(depression_rf_pred_values, 
                   truth, 
                   .pred_0))

I collected the predictions from the 20-fold cross-validation of the random forest model and calculated the ROC AUC for each fold, confirming the model’s performance. The ROC curve was plotted to visualize the classification ability.

After fitting the model on the full training data, I tested it on the separate test set. The model achieved a ROC AUC of 0.9109 on the test data, demonstrating strong classification performance. The ROC curve for the test predictions further supported this finding.

depression_rf_fit |>
  extract_fit_engine() |>
  importance()
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
cca11hwoftchs    9.563661 -1.7962818             8.203725         7.822884
cca11hwoftshp   12.500052  0.3411634            12.150785         7.437493
cca11hwoftott   14.159072  2.9023277            13.452012         3.680144
che11enrgylmt   16.785016 14.3162529            20.239025         8.604121
cac11diffphy    14.671850  4.9301150            14.988892         7.329352
cac11exhaustd   12.251929 24.6368474            21.068868        20.069255
cac11diffemlv   14.793970  7.6083224            16.205137         7.940290
che11health     22.052720  6.0773495            22.220091        13.453691
che11sleepint   11.066770  8.7171100            13.721008        10.560901
ew11progneed1    2.597015 -0.4566291             1.497839         1.953672
race_recode      9.273959 22.1873500            16.982229        13.432903
gender_recode    9.117948 -1.6028781             7.961775         3.337548
edu_recode       9.953115  0.5429626             9.585333         6.140666
chd11dage       13.096725 12.2668800            16.588852        18.978122
martstat_recode  7.514458  0.6857396             7.185008         3.211368
depression_rf_fit |>
  extract_fit_engine() |>
  vip()

rf_spec_depression<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

rf_fit_depression<-rf_spec_depression|>
  fit(phq2_cg_cat ~ ., data=depression_train)


#20-fold cross validation on rf
depression_folds<-vfold_cv(depression_train, v=20)

rf_workflow_depression <- workflow() |>
  add_model(rf_spec_depression) |>
  add_formula(phq2_cg_cat ~ .)

rf_fit_cv_depression <- rf_workflow_depression |>
  fit_resamples(resamples = depression_folds)

collect_metrics(rf_fit_cv_depression)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.807    20 0.0168  Preprocessor1_Model1
2 brier_class binary     0.119    20 0.00770 Preprocessor1_Model1
3 roc_auc     binary     0.880    20 0.0185  Preprocessor1_Model1
#roc_auc = 0.8848794        

rf_wf_fit_cv_depression <- rf_workflow_depression |>
  fit_resamples(
    resamples = depression_folds,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(rf_wf_fit_cv_depression) #roc_auc = 0.8985  
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.809    20 0.0168  Preprocessor1_Model1
2 brier_class binary     0.119    20 0.00777 Preprocessor1_Model1
3 roc_auc     binary     0.878    20 0.0182  Preprocessor1_Model1
rf_wf_fit_cv_depression |>
  collect_predictions() |>
  roc_curve(phq2_cg_cat, .pred_0) |>
  autoplot()

# Collect metrics from the resampling results

collect_metrics(rf_wf_fit_cv_depression)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.809    20 0.0168  Preprocessor1_Model1
2 brier_class binary     0.119    20 0.00777 Preprocessor1_Model1
3 roc_auc     binary     0.878    20 0.0182  Preprocessor1_Model1
#roc_auc = 0.89855      

# If you need predictions for further analysis
rf_wf_fit_cv_depression_preds <- collect_predictions(rf_wf_fit_cv_depression)

The ROC AUC for the cross-validation results was 0.8995, indicating good model performance. After including the save_pred = TRUE control option, which stores predictions for each fold, I recalculated the ROC AUC, obtaining 0.8885.

4.2.4.3 Graph

Anxiety

# ROC curve for GLM training set
anxiety_roc_glm_training <- anxiety_glm_pred_values |> roc_curve(truth, .pred_0)
# ROC curve for GLM cross-validation set
anxiety_roc_glm_cv <- anxiety.lr.pred.values.test |> roc_curve(truth, .pred_0)
# ROC curve for RF training set
anxiety_roc_rf_training <- anxiety_rf_pred_values|> roc_curve(truth, .pred_0)
# ROC curve for RF cross-validation set
anxiety_roc_rf_cv <- rf_wf_fit_cv_anxiety_preds |>
  roc_curve(truth = gad2_cg_cat, .pred_0)


ggplot() +
  geom_path(data = anxiety_roc_glm_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "blue", linetype = "solid", size = 1, label = "Logistic Regression (Training)") +
  geom_path(data = anxiety_roc_glm_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "green", linetype = "dashed", size = 1, label = "Logistic Regression (10-fold CV)") +
  geom_path(data = anxiety_roc_rf_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "red", linetype = "solid", size = 1, label = "Random Forest (Training)") +
  geom_path(data = anxiety_roc_rf_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "purple", linetype = "dashed", size = 1, label = "Random Forest (10-fold CV)") +
coord_equal() +
  theme_bw() +
  labs(title = "Anxiety ROC Curves for Logistic Regression and Random Forest",
       x = "1 - Specificity (False Positive Rate)",
       y = "Sensitivity (True Positive Rate)",
       color = "Model", 
       linetype = "Type") +
  theme(legend.position = "right") + 
   geom_text(aes(x = 0.7, y = 0.2, label = "Logistic Regression (Training)"), 
            color = "blue", size = 2) +
  geom_text(aes(x = 0.7, y = 0.15, label = "Logistic Regression (10-fold CV)"), 
            color = "green", size = 2, linetype = "dashed") +
  geom_text(aes(x = 0.7, y = 0.1, label = "Random Forest (Training)"), 
            color = "red", size = 2) +
  geom_text(aes(x = 0.7, y = 0.05, label = "Random Forest (10-fold CV)"), 
            color = "purple", size = 2, linetype = "dashed")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in geom_path(data = anxiety_roc_glm_training, aes(x = 1 - specificity,
: Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_glm_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_rf_training, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_rf_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_text(aes(x = 0.7, y = 0.15, label = "Logistic Regression
(10-fold CV)"), : Ignoring unknown parameters: `linetype`
Warning in geom_text(aes(x = 0.7, y = 0.05, label = "Random Forest (10-fold
CV)"), : Ignoring unknown parameters: `linetype`

4.2.5 Depression

library(yardstick)

# ROC curve for GLM training set
depression_roc_glm_training <- depression_glm_pred_values |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for GLM cross-validation set
depression_roc_glm_cv <- depression.lr.pred.values.test |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for RF training set
depression_roc_rf_training <- depression_rf_pred_values |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for RF cross-validation set
depression_roc_rf_cv <- depression_rf_cv_preds |>
  roc_curve(truth = phq2_cg_cat, .pred_0)


ggplot() +
  geom_path(data = depression_roc_glm_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "blue", linetype = "solid", size = 1, label = "Logistic Regression (Training)") +
  geom_path(data = depression_roc_glm_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "green", linetype = "dashed", size = 1, label = "Logistic Regression (10-fold CV)") +
  geom_path(data = depression_roc_rf_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "red", linetype = "solid", size = 1, label = "Random Forest (Training)") +
  geom_path(data = depression_roc_rf_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "purple", linetype = "dashed", size = 1, label = "Random Forest (10-fold CV)") +
coord_equal() +
  theme_bw() +
  labs(title = "Depression ROC Curves for Logistic Regression and Random Forest",
       x = "1 - Specificity (False Positive Rate)",
       y = "Sensitivity (True Positive Rate)",
       color = "Model", 
       linetype = "Type") +
  theme(legend.position = "right") + 
   geom_text(aes(x = 0.5, y = 0.2, label = "Logistic Regression (Training)"), 
            color = "blue", size = 2) +
  geom_text(aes(x = 0.5, y = 0.15, label = "Logistic Regression (10-fold CV)"), 
            color = "green", size = 2, linetype = "dashed") +
  geom_text(aes(x = 0.5, y = 0.1, label = "Random Forest (Training)"), 
            color = "red", size = 2) +
  geom_text(aes(x = 0.5, y = 0.05, label = "Random Forest (10-fold CV)"), 
            color = "purple", size = 2, linetype = "dashed")
Warning in geom_path(data = depression_roc_glm_training, aes(x = 1 -
specificity, : Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_glm_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_rf_training, aes(x = 1 -
specificity, : Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_rf_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_text(aes(x = 0.5, y = 0.15, label = "Logistic Regression
(10-fold CV)"), : Ignoring unknown parameters: `linetype`
Warning in geom_text(aes(x = 0.5, y = 0.05, label = "Random Forest (10-fold
CV)"), : Ignoring unknown parameters: `linetype`

Limitations:
Despite the robust analysis and promising results, there are several limitations to consider. First, the sample size for both the training and testing sets may not fully capture the diversity of the target population, potentially limiting the generalizability of the findings. Second, while cross-validation helps to reduce overfitting, it does not account for all potential biases or variations within different subgroups of the data. Third, missing data, if not adequately handled, could have introduced bias or reduced the accuracy of model predictions. Furthermore, the models used are based on observed data and may not fully capture complex relationships or interactions that could emerge with a broader set of variables or external factors. Lastly, the reliance on specific features may limit the scope of the model’s performance in more dynamic or varied real-world scenarios.

4.3 Conclusion

In conclusion, we performed 20-fold cross-validation on both anxiety and depression datasets using GLM and random forest models. For both outcomes, we achieved high AUC values, indicating strong model performance. The variable importance analysis highlighted key features influencing predictions. Additionally, the ROC curve plots demonstrated good classification accuracy for both anxiety (AUC = 0.91) and depression (AUC = 0.91) models. These results suggest that the random forest models are effective in predicting anxiety and depression based on the available features.